This notebook gives context for what strains are in our strain bank and shows that standard analyses don’t work for capturing genomic strain variation.
# read in databiobank =readh5mu(joinpath(bbdir, "BB669.h5mu"))biobank.obs[!,:kept_species] =convert.(Bool,biobank.obs[!,:kept_species]);biobank
┌ Warning: Cannot join columns with the same name because var_names are intersecting.
└ @ Muon /Users/bend/.julia/packages/Muon/eLqpV/src/mudata.jl:351
# we have taxanomic information each of these strains# GTDB_* are the annotations from https://gtdb.ecogenomic.org/# NCBI_* are the annotation post-submission to NCBI# * annotations are pre-submission to NCBI# donor specifies which human microbiome each strain was sampled from# <hospital>.<donor number>bbobs = biobank.obs;first(bbobs, 5)
5×22 DataFrame
Row
Strain_ID
Accession
Phylum
Class
Order
Family
Genus
Species
Donor
NCBI_Phylum
NCBI_Class
NCBI_Order
NCBI_Family
NCBI_Genus
NCBI_Species
GTDB_Phylum
GTDB_Class
GTDB_Order
GTDB_Family
GTDB_Genus
GTDB_Species
kept_species
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
String
Bool
1
MSK.16.19
JAHOLG000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
MSK.16
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides eggerthii
false
2
MSK.10.5
JAHOMY000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
MSK.10
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis_A
false
3
MSK.13.23
JAHOMK000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
MSK.13
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides fragilis_A
false
4
MSK.16.61
JAHOKD000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
MSK.16
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Phocaeicola
Phocaeicola vulgatus
true
5
MSK.18.56
JAHOHX000000000
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
MSK.18
Bacteroidetes
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
Bacteroidota
Bacteroidia
Bacteroidales
Bacteroidaceae
Bacteroides
Bacteroides caccae
false
# concentrated for lachno's and bacteroidaceae # (matches distibutions normally found in these types of strain bank)family_counts =sort(countmap(bbobs.NCBI_Family), byvalue=true, rev=true)
proportion CSB strains from different NCBI families
species_counts =collect(values(countmap(bbobs.NCBI_Species)));histogram( species_counts, bins=100, xlabel="# of replicates", ylabel="# of species", size=(500,500),)
each bar represents the number of measured species (y axis) with a certain number of replicates (x axis). Most species we classify only have 1 strain representative, 11 species have at least 20 strain representatives.
Phyml on 16S alignment of Biobank strains
16S is a ribosomal protein (involved with translating RNA to Proteins) that is present across all bacteria. For that reason, it is commonly used as a gene marker for constructing ancestral trees of bacteria and taxanomic classification.
What we show is that 16S annotations are only good up to species level classification. At that point all strains are completely indistinguishable from each other.
Bac120 aimed to be an improvement over 16S annotation. It takes 120 genes that are reasonably well conserved across bacteria, and takes ~50 genomic positions from each of those genes, creating an annotation of around 6000 features.
We find that these annotations equally only work down to species level classification, with strain level replicates all being identically annotated.
However, these strains are not functionally identical. Strains of the same species, digest/produce different amounts of metabolite compounds. And it is not simply individual variation. In the plot below we see bi-modal distribution indicating that there are groups of strains that behave differently.
# load metabolites matrixmetmtx = biobank["metabolites_foldchange"].X[:, :][biobank.obs.kept_species,:];# too close to zero for detectable measurment, so assume no changemetmtx[isinf.(metmtx)] .=0.0; @showsize(metmtx)# filter to metabolites that have at least 10% detectable shifts in measurementmetabolite_names_full = biobank["metabolites_foldchange"].var_names.vals;keepmetabolites_mask =mapslices(c->mean(c .==0.0) <0.9, metmtx, dims=1) |> vec;metabolite_names = metabolite_names_full[keepmetabolites_mask]metmtx = metmtx[:, keepmetabolites_mask];@showsize(metmtx);
show log-2 fold change from base media across strains from 9 species with more than 20 replicates. multi-modal distributions for some distributions indicates sub-species phylogentic variation in phenotype.
Using the orthologous gene-group annotations can we represent these strains in a way that captures functional diversity?
We find that using traditional PCA fails to create this representation. Regardless of if we subset to narrow taxanomic groups, metabolic variation is never the principle axis of variation.
csb_mtx = biobank["oggs"].X[:,:]# find variable OGGsogg_mask =vec(var(csb_mtx, dims=1) .>0)# mean center variable OGGsmeancentered_csbmtx =mapslices(x->x.-mean(x), csb_mtx[:, ogg_mask], dims=1)# svd of mean centered matrixbbusv_csb =svd(meancentered_csbmtx);@showsize(meancentered_csbmtx);
size(meancentered_csbmtx) = (669, 8359)
# Principal Components from SVDpcs_csb = bbusv_csb.U *Diagonal(bbusv_csb.S);# percent of variance explained by each PCpctvar = (bbusv_csb.S .^2/sum(bbusv_csb.S .^2)) *100;
map colors to NCBI families in correct order for plotting
familyid = bbobs.NCBI_Familyorderedfamilylabels =stack(DataFrame(countmap(familyid)), 1:12) |> df -> DataFrames.transform(df, :value => (-) =>:minusvalue) |> df ->sort(df, [:minusvalue, :variable], rev=false) |> df -> df.variablefamilycolors =permutedims(palette(:Set3_12).colors.colors[indexin(sort(unique(familyid)), orderedfamilylabels)]);# collect Butyrate and Succinate values for colormapbutyrate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Butyrate"), biobank["metabolites_foldchange"].var.label)];succinate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Succinate"), biobank["metabolites_foldchange"].var.label)];# get color limits based on most extreme valuebutyrate_lims =getlims(butyrate_foldchange_per_strain)succinate_lims =getlims(succinate_foldchange_per_strain);
The principal axis of variation across the strain bank is phylum, within each phyla there is no pattern to which strains produce/digest metabolites.
Each sub-plot shows 669 CSB strains on PC 1 (x axis) and PC 2 (y axis). Aqua dots correspond to strains belonging to the species listed in each sub-title followed by the number of strains belonging to that species. Remaining strains are shown in grey.
csb_mtx = biobank["oggs"].X[:,:][mask, :]# find variable OGGsogg_mask =vec(var(csb_mtx, dims=1) .>0)# mean center variable OGGsmeancentered_csbmtx =mapslices(x->x.-mean(x), csb_mtx[:, ogg_mask], dims=1)# svd of mean centered matrixbbusv_csb =svd(meancentered_csbmtx);@showsize(meancentered_csbmtx);
size(meancentered_csbmtx) = (229, 3755)
# Principal Components from SVDpcs_csb = bbusv_csb.U *Diagonal(bbusv_csb.S);# percent of variance explained by each PCpctvar = (bbusv_csb.S .^2/sum(bbusv_csb.S .^2)) *100;
map colors to NCBI Species in correct order for plotting
genusid = biobank.obs.NCBI_Genus[mask]# collect Butyrate and Succinate values for colormapbutyrate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Butyrate"), biobank["metabolites_foldchange"].var.label)][mask, :];succinate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Succinate"), biobank["metabolites_foldchange"].var.label)][mask, :];
csb_mtx = biobank["oggs"].X[:,:][mask, :]# find variable OGGsogg_mask =vec(var(csb_mtx, dims=1) .>0)# mean center variable OGGsmeancentered_csbmtx =mapslices(x->x.-mean(x), csb_mtx[:, ogg_mask], dims=1)# svd of mean centered matrixbbusv_csb =svd(meancentered_csbmtx);@showsize(meancentered_csbmtx);
size(meancentered_csbmtx) = (103, 2065)
# Principal Components from SVDpcs_csb = bbusv_csb.U *Diagonal(bbusv_csb.S);# percent of variance explained by each PCpctvar = (bbusv_csb.S .^2/sum(bbusv_csb.S .^2)) *100;
map colors to NCBI Species in correct order for plotting
speciesid = biobank.obs.NCBI_Species[mask]# collect Butyrate and Succinate values for colormapbutyrate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Butyrate"), biobank["metabolites_foldchange"].var.label)][mask, :];succinate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Succinate"), biobank["metabolites_foldchange"].var.label)][mask, :];
Same story subset to the level of an individual species.
Even here where we have introduced a large amount of constraint in the type of variation in our dataset, metabolite concentrations are not what is varying by these principal axes of variation. Meaning we can find many cases where strains are overlaid on top of each other on these axes, yet still have very different behavior either eating or producing metabolites. Similarly, on either side of the plot, we can see strains that produce/consume metabolites at similar levels.
csb_mtx = biobank["oggs"].X[:,:][mask, :]# find variable OGGsogg_mask =vec(var(csb_mtx, dims=1) .>0)# mean center variable OGGsmeancentered_csbmtx =mapslices(x->x.-mean(x), csb_mtx[:, ogg_mask], dims=1)# svd of mean centered matrixbbusv_csb =svd(meancentered_csbmtx);@showsize(meancentered_csbmtx);
size(meancentered_csbmtx) = (93, 1551)
# Principal Components from SVDpcs_csb = bbusv_csb.U *Diagonal(bbusv_csb.S);# percent of variance explained by each PCpctvar = (bbusv_csb.S .^2/sum(bbusv_csb.S .^2)) *100;
map colors to NCBI Species in correct order for plotting
speciesid = biobank.obs.NCBI_Species[mask]# collect Butyrate and Succinate values for colormapbutyrate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Butyrate"), biobank["metabolites_foldchange"].var.label)][mask, :];succinate_foldchange_per_strain = biobank["metabolites_foldchange"].X[:, findfirst(==("Succinate"), biobank["metabolites_foldchange"].var.label)][mask, :];
UMAP doesn’t help. If we use 10 principal components, and compress with UMAP into 2 dimensions we get the same story. Species are compressed into nearly single points, and the metabolite variability contained within is lost.
Calcuate PCs on full CSB
theme(:default, grid=false, framestyle=:box, label="", ratio=1)csb_mtx = biobank["oggs"].X[:,:]# find variable OGGsogg_mask =vec(var(csb_mtx, dims=1) .>0)# mean center variable OGGsmeancentered_csbmtx =mapslices(x->x.-mean(x), csb_mtx[:, ogg_mask], dims=1)# svd of mean centered matrixbbusv_csb =svd(meancentered_csbmtx);@showsize(meancentered_csbmtx);# Principal Components from SVDpcs_csb = bbusv_csb.U *Diagonal(bbusv_csb.S);# percent of variance explained by each PCpctvar = (bbusv_csb.S .^2/sum(bbusv_csb.S .^2)) *100;familyid = bbobs.NCBI_Familyorderedfamilylabels =stack(DataFrame(countmap(familyid)), 1:12) |> df -> DataFrames.transform(df, :value => (-) =>:minusvalue) |> df ->sort(df, [:minusvalue, :variable], rev=false) |> df -> df.variablefamilycolors =permutedims(palette(:Set3_12).colors.colors[indexin(sort(unique(familyid)), orderedfamilylabels)]);
Each sub-plot shows 669 CSB strains on UMAP 1 (x axis) and UMAP 2 (y axis) generated from euclidean distance across the leading 10 PCs and 77 shared nearest neighbors. Aqua dots correspond to strains belonging to the species listed in each sub-title followed by the number of strains belonging to that species. Remaining strains are shown in grey.